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Abstract 

Testing if a p-dimensional sample, for p > 1, comes from a normal population 
is a fundamental problem in statistics. In this paper we propose a Bayesian test 
of p-variate normality against an alternative hypothesis characterized by a certain 
Dirichlet process mixture model. It is shown that this nonparametric alternative 
satisfies the desirable embedding and predictive matching properties with respect 
to the normal null model. To compute the Bayes factor, an efficient sequential 
importance sampler is is proposed for evaluating the marginal likelihood under the 
nonparametric alternative. Numerical examples show that the proposed test has 
satisfactory discriminatory power when the distribution is not normal, and does 
not tend to over-fit when the distribution is normal. 

Keywords and phrases: Bayes factor; embedding; goodness-of-fit test; impor- 
tance sampling; nonparametric model; predictive matching. 



1 Introduction 



Consider a sample Xi,n = (Xi,...,X„) of n independent observations, Xi G W, dis- 
tributed according to a common probability measure F, i.e., F{A) = Pr(Xj G A) for 
measurable A C W. Many statistical procedures, such as linear regression and analysis 
of variance, assume that the observations are samples from a normal population, so an 
important first step is verification of this assumption. The goal is to test the hypothesis 
Hq: F e ^0, where 

^0 = {F^,. = N(/i, aa') : (/i, a) x T^} (1) 
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is the set of all p-dimensional normal distributions. Here Tp is the set of all p x p 
lower-triangular matrices with positive diagonal elements, so that aa' is the Cholesky 
decomposition of the covariance matrix. A Bayesian treatment of this problem and, more 
generally, of goodness-of-fit tes ting of an arbitrary parametric model, has been recently 



explored in the literature fe.g.. iBerger and Guglielmill200ll : ICarota and Parmigianilll996 



Florens et al.lll996 : Verdinelli and Wassermanlll998 ). Unlike classical goodness-of-fit tests 



(cf. lDasGuptall2008l . Chap. 28), the Bayesian setup requires (i) a completion of the null 
model ([1]) with the assignment of a prior distribution Ho on determined by a possibly 
improper prior ttq on x Tp and the mapping (/x, cr) h- >■ -F}(,o-, and (ii) a specification of 
an alternative model Hi : F & ^\ and a prior Hi on For a non- subjective treatment, 
it is natural not to restrict ^\ to a finite-dimensional parametric family; instead, one 
chooses ^\ to be an infinite-dimensional subset of probability measures on W ^ and IIi a 
probability measure supported on Once the priors IIo and Hi are specified, one can 
report the Bayes factor 



(2) 



as a measure of evidence against iJo when X\.n = xi-n are observed. 

Towards a non-subjective, default choice of priors (7ro,ni), both embedding and pre- 
dictive matching have been proposed as desirable properties. Embedding typically means 
expressing IIi as a mixture J H^^a d7ri{fi, a), where each 11^ gives a nonparametric prior 
distribution over probability measures on MP that form a local alternative to -F/^.o-, and 
TTi, like vTo above, is a possibly improper prior on MP x Tp. The local alternative property 
of HiL. n- can be formalized by identifying Fj,,^„ in ([H) as the ce nter of H^ f^, such as its 



mean (IBerger and Guglielmill200ll: ICarota and Parniigianilll996l ). or some other measure 
of central tendency ( IVerdinelli and Wassermanlll998l ). Beyond this embedding property. 



it is difficult to pursue formal non-subjective requirements in choosing 11^^ because of 
its extreme dimensionality. Instead, extrinsic justifications, such as computational ease 
and attractiveness of 11^,^ purely as a statistical model, are taken into consideration. 

With embedding in place, and the choice of 11^^^ justified by extrinsic means, it 
remains to choose the priors ttq and tti on x Tp. It is here that predictive matching 
plays an important role. Often a default choice of ttq, usually improper, can be obtained 
through formal arg uments, with infer ence on a) under the parametric null model the 
ultimate goal 

f FdIl„, jF) : 



,see 



Ghosh et al.l l2006l . Chapter 5). In light of t he embedding property 



F^^cr, it is tempting to choose vri the same as ttq (ICarota and Parmigiani 



19961 ) so that the elements of x Tp are weighted the same under the null and alternative 
models. IBerger and Guglielmil (120011 ) find this reasoning insufficient and argue that the 
choice TTi = VTo is partially justified if the predictive distribution of a hypothetical sample 
of size TT-min IS the same under the two models, where nmin is the minimal sample size 
needed to obtain a proper posterior for (/i, a) under either model. The intuition is that 
a sample of size rimin is needed to barely identify (//, a), and one should not be able to 
tell the two models apart wi thout additional data. 

For univariate data X^■„. \BeTgeT and Guglielmi (l200lh present two compelling choices 
of n^^CT in the form of Polya tree distributions (jhavinel 1992 . 1994; Mauldin et al. 1992 ) 
which satisfy the embedding property. For either choice, they demonstrate that, for the 
normal parametric model N(/i, cr^), a common ttq = tti = tth, where tth is the right Haar 
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measure given by (ivTi^ (/i, a) = (1/ cr) dfi da satisfies the predictive matching property with 
nmin = 2. This is a non-trivial result that follows from a basic identity in iBerger et al. 



f ll998l ) and utilizes the fact that (//, a) can be identified as a location-scale parameter 
under both Hq and Hi, thus providing further extrinsic j ustification for choosing a com- 
mon prior. The proposals in iBerger and Guglielmil (120011) of fer substantial improvements 
over a simila r construction i n ICarota and Parmigianil (119961 ) based on a Dirichlet process 
distribution ( lFergusonlll973l ) as n^^^- That a Dirichlet process is supported on the set of 
discrete probability measures makes it an unattractive choice as an alternative to N(u, g^) 



and leads to rather undesirab le properties of the Bayes fact o r ([Berger and Guglielmill2001 



Carota and Parmigianil 119961 ) . IVerdinelli and WassermanI ( 1l998l ) use a logistic Gaussian 



process for H^ o-? which does support probability measures with smooth densities, but 
their proposal is difficult to compute with and poses serious challenges to a non-subjective 
treatment of the parametric priors ttq and tti. 

Interestingly, none of the available alternative specifications makes use of Dirich- 
let process mixtures, which are arguably the most attractive nonparametric distribu- 
tions for modeling an unkn own probability measure that admits a density; see, e.g.. 



Miiller and Quintanal (120041 ) and the references therein. Models based on Dirichlet pro- 



cess mixture distributions, particularly those that mix over normal kernels, are easy 



to compute with, often via ef f icient Gibbs sampl i ng rnethod s (lEscobar and WestI Il995 



MacEachern and Miillerl Il998l : iMacEachernl Il998l : iNeall 120001 ) and are known to possess 



optimal, adaptive, nea rly parametric convergen c e rates in various applications, includ 



ing density estimation ( IGhosal et al.lll999L l2000l : iGhosal and van der Vaartll200lL 120071 ). 



These nice convergence properties obtain because such a distribution sits on a space of 
probability measures with infinitely smooth densities, given by mixtures of normal prob- 
ability measures, which offer sharp approximations to any probability measure with an 
arbitrarily smooth density. In contrast, a Polya tree distribution sits on prob ability mea- 
sures with densities that are nowhere differentiable (jChoudhuri et al.l 120051 ) . a property 
that automatically limits such a distribution's ability to concentrate ar ound probabil 



ity measures with smooth densities and leads to inefficient estimation (jCastilld 12008 



van der Vaart and van ZantenI |2008| ). Moreover, one can argue that for testing the fit 



of the normal distribution, probability measures with infinitely smooth densities form a 
more relevant alternative set than those with non-differentiable densities. 

The main difficulty in using the Dirichlet process mixture distribution for Il^^cr appears 
to be the limited embedding capacity of such a distribution. For example, the mean of 
a Dirichlet process mixture of normals is a mixture of normals, and thus can equal „ 
only if o- is exactly a mixture of normals, not just a limit of such mixtures. This limited 
embedding capacity is the price one pays for ensuring that the nonparametric distribution 
concentrates on probability measures with smooth densities; logistic Gaussian processes 
suffer from the same shortcoming. 

In this paper we show that for F^^. = U{^,aa'), a specially designed Dirichlet 
process mixture of normals distribution 11^ „- indeed satisfies the embedding property 
f Fc/n^j o-l-^) = ^fi,cT- In addition to satisfying this technical condition and possessing 
the usual support properties of a Dirichlet process mixture of normals distribution, our 
special construction also offers an intuitive interpretation of II^^o- as an alternative to 



That is, a sample F ~ ^u,a can be described as a random "granulation" of F, 



into a mixture of normal measures, where each component occupies a fraction of the 
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volume of F^^a-, with the volume being negatively correlated with the distance between 
the component's center and the center fi of -F)t,CT- By introducing appropriate latent pa- 
rameters, the mixture representation of F under the alternative translates in to a latent 
clustering of the observations Xi-n sampled from F flEscobar and Westlll995l ). with the 
"extent" of clustering being a key factor in separating F ~ 11^ o- from the null element 
Ffj^^cr- By carefully choosing the model hyperparameters, we show how to perform the 
testing at different levels of separation between Hq and Hi by varying a single scalar 
parameter. We further show that, for p > 1, (/U, a) remains a location-scale parame- 
ter for both F^ o- and II^ o- and, consequently, a common choice of ttq = tti = tth, with 
o?7rj^(/i,(T) = (detcr)-(P+^)/2^/iC?cT giving the dimensional right Haar measure on x Tp, 
ensures that the predictive matching property holds with n^i^ = p + 1 — one observation 
to identify n and an additio nal jo observat i ons to identify a. This last result, which is an 
extension of Theorem 4.1 in iBerger et al.l ( 119981 ) . may be of independent interest. 

The remainder of the paper is organized as follows. In Section |2] we introduce our 
particular class of Dirichlet process mixture model alternatives and derive the aforemen- 
tioned properties. We develop a sequential importance sampling strategy in Section E] 
for estimating the marginal likelihood under this Dirichlet process mixture model frame- 
work. Numerical examples are presented in Section HI and concluding remarks are made 
in Section [51 



2 A Dirichlet process mixture alternative 



2.1 Model specification 

Let §p be the space of p x p symmetric positive definite matrices with all p eigenvalues in 
(0, 1). For scalars ui an d 002 greater than (p — 1)/2, let Be(cc;i,a;2) denote the multivariate 
beta distribution on E>p ( lMuirheadlll982l . Chapter 3.3) having density 



Be{v I Ui,U2) = ap(wi,W2)(dettO'^^-(P+^)/^{det(/p - ^)}^2-(p+i)/2^ 



(3) 



where Ip is the p x p identity matrix and ap{ui,U2 
Tp{z) - -r>(P-l)/^ TIP 



p ...^ y y .^^.....j x.xc....^ c...^ ^py^ „ ^-^ j - Tp{Ul + ^2 ) /Fp (^1 ) Fp (^2 ) , Wlth 

^p(p-i)/4 j-j-p^^ -p^^ — (^i — l)/2) the p-variate gamma function. Write for the 
probability measure on MP x Sp given by the law of {U,V), where V ~ Be{uJi,uj2) and 
U I V ~ N{0, Ip — V). This law is well-defined, since Ip — V & S>p with probability 1. 

Let DP(a, \l/) denote the Dirich let process distr ibution with precision constant a > 
and base measure from above fiFergusoru Il973l ). Recall that ~ DP(a, \E') means 
that for any positive integer k and any measurable partition Bi, . . . , of x Sp, the 
probability vector { \1/ ),..., \l/(5fc)} has a fc-dimensional Dirichlet distribution with 
parameters {a;\l/(i?i), . . . , a\l/(i?fc)}. Given this Dirichlet process distribution, for each 
(/i, a), let DPM^ \E') denote the distribution of the random probability measure 



= J + au,ava') d'^{u,v), where ^'~DP(a,^'); 



(4) 



see 



Lol (119841 ). This distribution is called a Dirichlet process mixture of normals, and it 



will be used in what follows as our local alternative II^ o- for F^^^. The Dirichlet process 



4 



mixture differs from that commonly found in the hterature in its use of the normal- 
beta base measure as opposed to the more conventional normal-inverse Wishart base 



measure f lEscobar and Westlll995l ). The advantage is that this new formulation satisfies 
the desirable embedding property described in Section [U see Theorem [T] below. 

To summarize our model formulation, we have the following hierarchical specifications 
of the null and alternative models: 

Ho : Xi,n I (/i, cr) ~ (/i, cr) ~ tTq (5) 

Hi : Xi., I F^,, ~ F^,,, F^,, | (/i, a) ~ DPM^,,(a, (/i, a) ~ vn. (6) 

In the following subsections we shall show that the embedding and predictive matching 
properties hold for this formulation of the testing problem. We shall also give a clustering- 
based interpretation of Hi as a natural alternative to Hq, discuss the choice of beta 
hyperparameters (ci;i,Ci;2), and give some recommendations on Bayes factor reporting. 



2.2 Null embedding property 

Here we show that II^^o- = DPM^ ^-(a, with (/i, a) in MP x Tp, satisfies the embedding 
property, i.e., the mean of DPMp,^o-(a, ^) is N(yU, era'). The key to the proof is that a 
normal distribution may be written as a convolution of two normals. 

Theorem 1. For any (fi^cr), the mean o/ DPM^^o-(«, ^) is N(/i,o"cr'). 

Proof. Given F^^^r ~ DPM^^CT(a, \E'), let ~ DP(a, \E') be the corresponding random 
mixing distribution in Since E(\E') = \E', Fubini's theorem implies 

E(F^,<t) = j + au,ava') dE{^){u,v) 

= J + au,ava') d'^{u,v) 




N(yU + au, ava') dN{u \ Id — v) > dBe{v \ ui, UJ2) 



= J aa') dBe{v \ uji,U2) 
= N(/x,aa'). 

The next-to-last equality above follows from the well-known Gaussian convolution identity 
/ N(a + bu, s) dN{u \ c,t) = N{a + be, s + btb'). □ 

It is clear from the proof that the multivariate beta distribution is not absolutely 
necessary. In fact, the same proof goes through for any distribution for V supported on 
Sp. So it may be possible to generalize our alternative formulation, though we will not 
investigate this any further here. 



2.3 Predictive matching property 

Theorem [1] establishes the crucial embedding property of the null within our proposed 
alternative. To show that predictive mat ching also holds in our formulation, we first es- 
tablish a generalization of Theorem 4.1 of iBerger et al.l (119981 ) to any arbitrary dimension 
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p > 1 and a nonparametric prior on the underlying distribution F. Note that, for arbi- 
trary p, the right Haar measure tih on MP x Tp is given by d-KnifJ', o") = Yl^=i ^fi da, 
where ajj is the j*^ diagonal element of the matrix a. 



2.3.1 Predictive matching: a general result 

In the following, for any random probability measure F ~ 11 on MP, let mu,n denote 
the average product measure / F^"- dIl{F), where F^" denotes the n-fold product of F. 
Note that mu,n is just the predictive distribution of a hypothetical sample of size n from 
the model F ~ 11 and Xi-n | F ~ F. We shall call a family of probability measures 
{n^ o- : (/i, 0") G X Tp} a location-scale family if there is a random probability measure 
F* on MP such that, for any {fi,a), the random measure F*^^ defined as dF*^^{x) = 
|detcT|-^dF*( a ^(x — /i)) is distributed according to n^,o-. A location-scale family will be 
called rotation-invariant if, for any orthogonal matrix rj, Fq ^, has the same distribution as 
F*. Also, we shall call a location-scale family absolutely continuous if the characterizing 
F* is absolutely continuous with respect to Lebesgue measure with probability 1. 

Theorem 2. Let F ~ 11 5e a random probability measure on M^, with H = J g. dTiH{^, a), 
where {H^ g. : (/i, a) G x Tp} is an absolutely continuous, rotation-invariant, location- 
scale family, and tth is the right Haar measure on M^ x Tp. Then, for any Xi:(p+i) in M^ 
such that Xj = Xj — Xp+i, j E 1 : p, linearly independent, 

mYi,p+i{xv.{p+i)) = Cp^l detx\~P, (7) 

where x is the p x p matrix with columns Xi, . . . ,Xp, and the normalizing constant Cp = 
2'P'k'p /"^ /Tp{p/2) is the volume of the p-dimensional Steifel manifold. 

In the proof below, integrals shall be carried out in the form of exterior products of 
differentials, which we denote as (dfi), etc. This use of exterior products leads to simpler 
change of variable formulas than those offered by traditional Jacobians. The changes of 



variab le used below, and the corresponding exterior products, can be found in iMuirhead 



( 119821 . Chap. 2). 



Proof of Theorem [H Let F* ~ 11* be the random measure that characterizes the abso- 
lutely continuous, rotation-invariant, location-scale family 11^,,-, and let /* denote its 
Radon-Nikodym derivative with respect to Lebesgue measure on M^. By Fubini's theo- 
rem, mn,p+i(xi;(p+i)) equals 



p+i 



{l[{det ar^f\a~\x, - d^lH{^l^ a)] dY[\f^) 

<'^p i=i 

Write /(/*) for the integral over M^ x Tp inside the square brackets in the right-hand side 
of the above display. A change of variable r = implies r ranges over Tp, an = t^^, 
deta = (detr)-\ and (da) = (detr)-(P+i) (dr). Therefore, 
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By the rotation-invariant assumption, the right-hand side above remains unchanged if 
we replace F* with Fq^,, for any orthogonal matrix rj. Let if, with dH{r]) = [j]' drf) / Cp, 
denote the Haar measure on Op, the space of p x p orthogonal matrices. Then we must 
have 



If we let u = rjT, then u ranges over the space Gp of p x p non-singular matrices, det r 
|dett^|, and {du) = Y[^=i'''ir\d,'^){v'di])- Therefore, 



p+i 



I{n = c;' f \llr{u{x,-fi))]\detu\{dfi){du). 

jRPxGp ^ ^ 

Note that (/i, z/) effectively ranges over ]Rp^(p+i), the {p + l)-fold product of W^. Make a 
final change of variable, Zi = v{xi — /i), z G 1 : (p + 1). The inverse transformation is 
given by where x is as in the statement of the theorem 

and, likewise, z is the p x p matrix with columns Zi = Zi — Zp^i. Therefore, the Jacobian 
of this transformation equals | det5|| det xl"*^^"^^ and so 



' / {Y[ nz,)}\ det dz,.,(^p+^) = c;'\ det x\-P, 

jRPX(p+l) l--^-^ J 



since J f*(zi) dzi = 1 with n*-probability 1 for each i G 1 : (p + 1). The claim ([7j) now 
follows immediately since /(/*) is free of /*. □ 

In particular, in the univariate case p = 1, the minimum training sample size is 
nrain = 2 and such a training sample consists of two distinct observations, say, xi and X2- 
Then x is just a number, namely xi — X2, and | det x\ = \xi — X2\- Furthermore, it is easy 
to check from the formula that Ci = 2. Therefore, the predictiy e dens ity for xi.2 is simply 



{2\xi — X2|}"^ which is exactly the result given in lBerger et al.l (Il998l . page 309, equation 
7). Their Theorem 4.1 extends this result to the linear model case. An extension of our 
Theorem [2] to the linear model case is surely possible, but we do not pursue this direction 
any further here. 

The important point is that, in addition to an extension to p > 1 dimensions. The- 
orem [2] allows for a nonparametric prior on the underlying distribution F. The con- 
dition on this nonparametric prior is that, given a draw F^^o- ~ H^^o- will al- 
most su rely admit an absolute l y cont inuous, rotation-invariant, location-scale represen- 



tation. 



Berger and Guglielmil (120011 ) argue that, for the p = 1 case, their Polya tree 
prior satisfies this property. In the next subsection we show that the proposed family 
{DPM^_o.(a, : (/x, a) G x Tp} satisfies the conditions of Theorem [2] for general p. 



2.3.2 Predictive matching for the Dirichlet mixture alternative 

Next we apply Theorem [2] to the particular Dirichlet process mixture model of Section 12711 
to establish the advertised predictive matching property. In particular, it follows imme- 
diately from the definition (jl]) that DPM^.o-(a, is a location-scale family characterized 
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by the random measure F* = J N(n, i;) t^) with \1/ ~ DP(a,\l/). Clearly F* is ab- 

solutely continuous with respect to Lebesgue measure because each N{u,v) is so. The 
following lemma shows that this family is also rotation-invariant. 



Lemma 1. For F* = J M{u,v) d'^{u,v) with \1/ ~ DP(a,\l/) and any rj G Op, both F* 
and Fq^i have the same distribution. 

Proof. For rj G Op, dHirjx \ u,v) = dN{x \ ri'u,7]'vr]) and, therefore. 



Kri' = y v) d^rjiu, V), ~ DP(«, ^^), 

where denotes the law of K,) = (yV^y'Vy) wh en {U,V) ~ But if V 



Be(a;i,a;2), then also K; ~ Be(a;i, a;2) (see lMuirheadlll982l . Exercise 3.22d) and if U \ V 



N(0, /p — V), then f/^ | ~ N(0, Jp — K?). Therefore, by construction of we have 
= and, hence, F* and Fq ^ have the same distribution. □ 

It is straightforward to see that {(F^,o-) : (/i, o") G M^xTp}, where {F) denotes a degen- 
erate distribution at F, is also an absolutely continuous, rotation-invariant, location-scale 
family. This leads to the following predictive matching property. 

Theorem 3. The two models ([5]) and (|6]), with ttq = tti = tth, produce the same predictive 
distribution for any hypothetical sample of size rimin = p+l- That is, models (jS]) and (E]) 
satisfy the predictive matching property. 

Proof. The claim follows from Theorem |2] since both {DPM^,^(a,^) : a) eWx Tp} 
and {(-F}i,o-) : (/^; o") G x Tp} are absolutely continuous, rotation-invariant, location- 
scale families, and the set of all (xi,...,Xp+i) G with singular x matrix forms a 
null set with respect to Lebesgue measure. □ 

2.4 Characterization via latent clusters 



The stick-breaking representation of a Dirichlet process (jSethuramanlll994l ) states that a 
random \1/ ~ DPfa, can be written as 



h=l 



where {Uh, Vh), h > 1, are independently draws from qh = (3h Y[j<hi^ ~ Pi) where 
/i > 1, are independent draws from a univariate Be(l, a) distribution, and again (([/, V)) 
denotes a degenerate distribution at {U, V). The vector qi-^o forms a probabihty vector, 
i.e., g/i > and J2h1h = 1; with probability 1. 

Consequently, given (/i, a), a draw from the local Dirichlet process mixture alternative 
DPM^^o-(a, has a representation of the form 

oo 

F^,a = ^'^^(/^ + ""^h, (TVha'), (9) 

h=l 

with {qh,Uh,Vh), h > 1, as described above. Therefore, given the local alter- 

native Xi,n ~ -F^,cr is equivalent to saying that the Xj's are independently distributed 
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according to N(/i + allh-, aVh-cr') where the his are randomly drawn labels with Pr(/ij = 
h) = Qh- Because the /ij's can have ties with positive probability, the equivalence re- 
lation i_2±j_ji_ai}^_onl^^ the data into clusters of observa- 



tions (IGhosh and Ramamoorthill2003l . Chap. 3), where the Xj's in a cluster are indepen- 
dent N(/i + aU,aVa') observations, with {U,V) ~ \E'. The center of this cluster is at a 
all shift from the center fi of the corresponding null element N{fi,aa') and occupies a 
(det Vy^^ G (0, 1) fraction of the volume of the null element. The magnitude {U'UY^'^ of 
the shift (relative to a) is stochastically inversely related to the volume fraction (det VY^"^, 
as can be seen in the following theorem. 

Theorem 4. // (f/, V) ~ ^, then Cov(f/'f/, det V) < 0. 

Proof. Since U\V ~ N(0, Jp - V), it follows that E{U'U \ V) = trE(f/f/' | V) = tr(Jp - 
V) = p — tiV, where ti A returns the trace of a symmetric matrix A. Then 

Cov([/'[/, det V) = Cov{E([/'f/ | V), det V} = -Cov(tr 1/, det V). (10) 

According to iMuirheadl (llQSl p. 112), the eigenvalues of ~ Be(a;i,a;2) are distri- 



butionally equivalent to the eigenvalues of A{A + B) ^, where A ~ Wish(a;i, Jp) and 
B ~ Wish(ci;2, /p). Since tiV and det are both coo rdinate-wise increasing func tions of 



these eigenvalues, it follows from the main result of iDvkstra and HewettI (119781 . Sec. 5) 



that Cov(tr det \^) > 0. This, along with f lTOjl . completes the proof. □ 

Therefore, given (//, a), the local alternative F^^^ in ([9]) can be seen as local granu- 
lations of a population of fine particles evenly distributed according to N(yU, aa'). The 
local granulations form clusters with bell-shaped curves, each occupying only a fraction 
of the total volume of the population. The further the cluster center is from the original 
H{ii,aa') population center, the smaller the cluster size is likely to be. 

2.5 Choice of beta hyperparameters 

It is well-known that, in ([8]), the degree of clustering, i.e., th e prevalence of ties in the 



labels hi, is controlled by the precision parameter a; see, e.g., iGhosh and Ramamoorthi 



fl2003l . Chap. 3). It follows from our discussion above that the degree of clustering is 
a key determinant of how different a local alternative F ~ DPM^ cr(a, \E') is from F^^^j. 
For this reason, we choose to use a as a tuning parameter that encodes the separation 
between N(/i, aa') and a draw from DPM^^o-(«, \1/). By varying a, we aim to cover a large 
spectrum of separation of the Dirichlet process mixture alternative and the null model. 
Operationally, the testing will be performed at every a in a range (0, oo) and the whole 
range of Bayes factors will be reported, from which the user can choose any summary of 
evidence against the null, su ch as those obtained by maximizing or averaging over a; see 



Berger and Guglielmil (120011 ) for more discussion on the use of such tuning parameters in 
testing. 

For large values of a, each cluster weight qh becomes miniscule. This discourages ties 
among the latent labels and makes ^ a fine discrete approximation of the continuous 
measure ^ . In fact, as a — oo, the random measure ^ collapses to the fixed measure 
\E'. Consequently, given (//, a), the random measure F^^„ in ^ converges to F^^„ = 
N(/i, era'). On the other hand, as a — >■ 0, the random \l/ converges to a random degenerate 
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distribution {{U,V)) and, hence, F^^^ converges to + aU,aVa'), with {U,V) ~ 
Thus, elements of the null model appear as the limit of o- when one turns the a knob to 
zero. The random nature of this limit, however, is unappealing because F^^^j is specifically 
designed to provide an alternative to F^^„ only. This shortcoming goes away if we ensure 
that \1/ also converges to ((0, Ip)) in the limit, so that F^^o- converges to its null counterpart 
For \l/ to converge to ((0,/p)), one needs uji/{uji + 002) — )■ 1 as a — )■ 0. To ensure 
this, we choose 

= ^ + and u2 = ^^ + a^^^'y'. (11) 

This is only one of the many possible ways to accomplish ui/ {U1+U2) — 1 as a — ?■ 0, which 
is essential to make \1/ collapse to ((0, Ip)). In addition to these theoretical justifications, 
the particular choice in f|TT]) is simple and useful from a computational point of view. 

Note also that, for (ci;i,ci;2) in ffTTl) . Ui/{ui + 0^2) — )■ as a — t- 00. Consequently, 
turning the a knob to 00 also results in the null model in the limit. However, the nature 
of this limiting path is quite different from that when a — )■ 0. Indeed, for small non-zero a, 
the random \1/ is close to degenerate. In terms of the stick-breaking representation ([8]), a 
single Qh dominates the rest. For large but finite a, there are no dominating Qhs, and F^^o- 
is made up of small contributions from many normal components. By choosing Ui and 
UJ2 as in flTT]) . the variance parameter aVhCr' of any such component is made increasingly 
tiny as a increases. Consequently, F^.o- is an approximately continuous mixture of many 
narrow normal kernels with an overall shape resembling N{fi,aa'). 



2.6 The precision parameter and Bayes factor reporting 



With the complete formulation of the null and alternative models, we can rewrite the 
Bayes factor in ([2]) as 



B 



/m.xt„ /{nr=i dF{x,)}dDPM^^^{F I a, ^) rfTr^l/i, a) ' 



(12) 



One can either subje ct the B a yes fa ctor itself to a threshold, rejecting Hq if and only if 



B is too small, as in iJeffrevsl (jl96lL page 432), or do the same with the posterior odds 
rB, where r is one's postulated prior odds in favor of Hq. 

The Bayes factor above actually depends on a and {oji,(x)2)- With (ci;i,Ci;2) chosen as 
in (llip . the Dirichlet process mixt ure alternative, and hence th e Bayes factor, is entirely 
determined by the scalar a. As in lBerger and Guglielmil ( 1200 ll ). we recommend carrying 
out the goodness-of-fit test separately for a range of a values covering its range, and pre- 
senting the Bayes factors side by side in the form of a plot. In our exam ples, we conside r 
a range of a values comparable to the interval (n~^,r2^) suggested by lEscobarl (119941 ). 
This plot of B versus a can also be interpreted as the reciprocal marginal likelihood for a 
under the Dirichlet process mixture model formulation. The minimum of this plot, corre- 
sponding to the Type II maximum marginal likelihood estimate of a, gives the maximum 
possible evidence against the null within the given class of alternatives. Alternatively, it 
is possible to combine these a-indexed family of models into a single overarching model 
by incorporating a prior distribution (e.g., gamma) for a. 
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3 Computation 



The computation of the numerator of B^xi-n) in (|T2!) is simple since the posterior under 
the null has a nice form. Computation of the denominator, on the ot her hand, requires 
i ntegr ation over an infinite- dimensional space and is non-trivial; see Kass and Raftervl 
fll995l ). We compute the denominator by introducing latent paramet er va l ues (U ,V):-n 
and applying a variation on the sequential imputation technique of iLiul fll996l ). Our 
method differs from Liu's in that (i) we partially collapse the mixture model in its U 
component, and (ii) we deal with the outer integration over {fj^,cr). 

The behavior of under our mixture specification of Hi can be described as 

n 

Xi..n I {U\ V*)^..^, a} ^ \{H{X, \ ^^ + aUl,aVla'), (13) 

1=1 

where {fi, a) ~ irnif^, cr), {U*, V*)i:n ~ are latent mixing parameters and Si.n is a latent 
vector of labels that tracks which observation is allocated to which mixing component, 
and these three variables are mutually independent. It suffices to restrict the latent labels 
to the space {(si, ■ ■ ■ , s„) e ( 1 : n)" : 5i = l,.Si-i-i < maxfgi-i) + 1, i G 1 : (n — 1)}. From 
the Polya urn representation (IBlackwell and MacQueerull973l ) of a Dirichlet process, the 
distribution of Si:n can be written as 



Pr(5i = 1) = 1, Pr(5,+i = e I Si:,) 



kl{Sl:i) 



a+i ^ e 1 : max(5'i,i) 
^ £ = max(5i;,) + l- 



where ke{Si;i) = \Si;i = i\ counts the number of j E I : i with Sj = i. 

It is possible to integrate out U*.^ from this description, with suitable changes made to 
(IT^ . Let f{xi;n, Si;n, v^.n, /i, 0") denote the resulting joint density of Si.n, V7^„,, yU, cr) 

and, for i e : (n — 1), let f^i{xi+i\xi;i, Si-,,, v^.^, /i, a) denote the associated conditional 
density of Xi+i given S'i:i, V^*„, /i, a). Also let /^i(si+i|xi:i+i, si:i, u^.^, /i, a) denote 
the conditional density of Si+i given (Xi:j+i, Si:i, V^.^, /i, a). It is easy to check that 

max(si:i) 

rX / I ★ \ I /\ \ fci/ I /\ 

Ji+i{Xi+i I Xi:i,Si;i,t;i.„,/i,(T) = — — N(xi+i \ n,aa)+ } — N(xi+i | iii^aiOf) 

a + I ^ a + I 

(14) 

rs /f,\ * \ \ c^^h{si,i)N{xi+i \ fie,aia'^), £ G 1 : max(si;i) 

Ic aN[Xi^i \ iJ,,aa ), « = max(si:jj + 1 



with 



fie = fi + a{Ip - v*^){v''^ + ki{si.,i){Ip - v*^)} ^ Xjl{sj = 
Oia\ = (Tv^{v^ + ki{si,i){Ip - v\)') ^{/p + ki,{si,i){Ip - w^)}^'. 



(16) 



and c = aN(xi+i | /i, aa') + ^^^^^('^i-) ki{si.,i)H{xi+i \ jj^e, aga'/,). 

The marginal density fHiixi^n) of Xi.„, under Hi can be calculated by integrating 
f{xi;n,Si;n,Vi.^,fj,,a) with respect to (si:„, f jf.^, /x, a). While this integral is analytically 
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intractable, one can approximate it by importance sampling Monte Carlo fiLid 12001 
Chap. 2.5). Let (SJ^^, Vj*™, /i™, cr™), m G 1 : M, be independent draws from a joint 
density /inip(si:n, fi.„, /i, ex) on the space of {Si:n,V*.^, fi,a). Then a root-M consistent 
estimate of the marginal likelihood fniixi-.n) is 



1 *^ 
M S 



f{x^..n,S^^,V,*:!;:,^,"',a"') 



m rrm 



(17) 



The efficiency of this approximation depends on how well f\-aip{si;n,v\.^, jji.a) approx- 
imates the conditional density of (S'l:^, /i, ci), given Xi,n = xi:n, under the joint 
density /(xi:„, Si:„, w*.^, /i, a). Below we present an /imp that achieves a fairly good ap- 
proximation. 

Let /imp {si;n, /i, 0") be the joint density of {Sun, V^.^, fi, a) where (/i, a) has density 
/j^p(/i,(j) to be specified later, ~ Be{uJi,uj2) independently of (/i,cr), and Si-^ given 
^i*n = ^i:n ^nd (/i, d) has density IY^'q ff+iis,+i \ Si;,, a) as given in 

This choice can be justified on two accounts. First, the conditional importance density 
of Si+i given {Si;i,Vi.^, ^,a) is precisely the partial conditional density of 5*^+1 given 
(Xi.(j+i), 5*1:4, /i, 0") under /. Second, the partial conditional density under / of V/ 
given {5*4+1 = uiax(5'i:j) + 1 = i, Xi,i+i,V^.i_^, fi, a} is precisely Be(a;i, a;2)- Let /^ and 
(jgnQ|;g f^Y^Q densities of V-^^^ ~ Be{uJi,uj2) and (//, a) ~ tth- It can be shown that 

/(Xl:n,Si:„, a) = /'^'^ (/i, (t) /^ « J 

n-1 

X JJ^ {/i^i(a;j+l I Si;i,vl.^, 



+ I a;i:j+l, Sl:j, Ui.„, /i, c 



i=0 



n-1 



/imp(Sl:n, Wl:„, /i, C^)7 1 1 /' 



/imp(/i, 0-) 



j+1 



(Xj+l I a;i:j, Sl:j, W*.^, /i, 0") 



j=0 



where the first equality follows from the sequential imputation calculations of iLiul ( 1l996l ) 
and the second equality follows from the definition of /imp- Therefore fll7p simplifies to 



1 ^ ff'^''{fi"',a'") " ^ 



E 



]^/i^i(a;i+i I xi:i,s™j,t'i.„,/i,(T) 



1=0 



^„^i/imp(/x-,a™) 



m=l 



where fuoixi-.n) = j YYi=i^i.^i I fJ',<ya')d7rH{fJ',cr) denotes the marginal density of Xi;„ 
under Hq and f^'^{fi, a \ Xi:„) denotes the posterior density of (/i, a) under this model. 

With fHi{xi;n) estimated by fniixi-n) in f|T8|) . an estimate of i? in f fT2|) obtains in 
B = fHo{xi;n)/fHi{xi;n)- Due to ffT^ and ffTSl) . this estimate simplifies to 



B 



1 

M ^ 



m=l 



/imp(/i",cr" 



i=0 



a 



max(s5") 



-1 -1 



h{s^,,) N(x,+i 






a-Fi N(xi+i 




cr'^a"") 



(19) 



12 



with formulas for fi"^ and suitably adapted from f lT6|) . In our implementations, we 
use the approximation in f lT9|) . where for every m, we process the observations xi-n in 
a random order. This extra randomness does not violate the theoretical validity of the 
approximation, instead, makes it practically more efficient. 

Equation f lTI?]) may suggest that f{^{fi,cr) can be chosen to approximate f^^{fi,a \ 
xi;n)- In reality, it should be chosen to approximate fff^ifiyCr \ the posterior den- 
sity of (/i, a) under Hi to make (ITTI) an efficient approximation. However, due to the 
embedding and predictive matching properties of the alternative, the posterior densities 
of (/i, a) under the two models can be expected to be similar to each other. Therefore 
a reasonable choice of fl^ifi^c) seems an approximation to f^'^{fi,a \ with heav- 
ier tails to guard against possible mismatch between this density and f'^'^{fi,a \ 
In the jo = 1 case, there are a number of standard ways this can be done. First, like in 
Berger and Guglielmil ( 2001 ). one can use the sampling distribution of the maximum like- 



lihood estimates (/i, a) to produce a bivariate Student-t density for (/i,log(T) from which 
importance samples can be easily obtained. In the examples presented in Section 14.11 we 
take a slightly simpler approach. Specifically, fl^^^ is the density of (/i, a) where, given 
a, — fi)/a has a Student-t distribution with 3 degrees of freedom, and a has a 

Burr distribution with density function (1 -|- cr/o")"^. The p > 1 case requires samples 
of mean vectors and covariance matrices as opposed to scalars. For this we take /j^p to 
be the multivariate normal-inverse Wishart posterior density under Hq, but rescaled to 
have heavier tails. In particular, we scale the covariance matrix of the normal component 
by a factor of ra, and take the degrees of freedom of the inverse Wishart component to be 
max{p, n — pn^/"^}. This approach is suitable for our general purposes, but further fine- 
tuning would likely lead to improved efficiency. R code is available at the first author's 
website, ,www. stat . duke . edu/~stll8/Sof tware, 



4 Numerical illustrations 



4.1 Univariate case, p =1 



Examp le 1. As a first illustration we revisit the example presented in lBerger and Guglielmi 
( 20011) with observatioiis on the log-lifetimes oi n = 100 Kevlar pressure vessels 
f lAndrews and Herzberg Il985l . page 183). A histogram in Figure [Dj^a) reveals that this 
data set has a significant left-skew, so we are inclined to believe that the underlying 
distribution is non-Gaussian. But we shall evaluate the Bayes factor using the sequential 
importance sampling strategy in Section |3] to confirm our inclination. In particular, for 
13 choices of a, namely a = 2", a G —6 : 6, the marginal likelihood approximation (|T7|) is 
obtained based on a Monte Carlo sample size of M = 20, 000. Figure [1] shows a plot of the 
corresponding Bayes factor, on the log^g scale, as a function of a = log2(a). For values 
off the a grid, we have used smoothing spline interpolation. The figure also displays a 
pointwise 95% confidence interval for the Bayes factor, obtained by bootstrapping the 
importance samples. In this plot, the minimum of \ogiQ{B[xi-n)} over a is approximately 
—5, i.e., min^ B{xi-n) ~ 10~^, showing negligible evidence in favor of the parametric null. 
This value i s comparable to, but a bit sra aller than, the minimum Bayes factor ^ 7 x 10~^ 
obtained in iBerger and Guglielmil (j200l[ ) for their Polya tree alternative. 
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I 1 1 1 1 1 

2 4 6 8 10 

log(lifetime) 

(a) Histogram of log- vessel lifetimes 



-6 -4 -2 2 4 

log2(«) 

(b) logj^Q-Bayes factor versus log2(Q!) 



Figure 1: Results from Example [TJ (a) Histogram with the best fitting normal density 
overlaid; (b) Bayes factor as a function of the Dirichlet process precision parameter a. 
The gray band represents a pointwise 95% bootstrap confidence interval. 



The advantage of the Dirichlet process mixture alternative is that samples from this 
posterior have infinitely smooth densities with probability one. This is in contrast to 
the almost sure nowh e re dif ferentiability of samples from the Polya tree alternative in 
Berger and Guglielmil ( 120011 ). The next example illustrates that the test based on the 
Dirichlet process mixture alternative has greater discriminatory power than that based 
on the Polya tree alternative for models which are (mostly) smooth but non-normal. 

Example 2. Consider three non- normal location-scale family models, namely, a Student- 
t distribution, a skew-normal distribution, and a uniform distribution. The first has 
heavier-than-normal tails, characterized by the degrees of freedom z/, the second has 
a skew, characterized by a shape parameter k, and the third has discontinuities and 
no tails. Here we compare the discriminatory power of the proposed Dirichlet process 
mixture-based test to that of Berger and Guglielmi's Polya tree-based test. 

Computation of the Dirichlet process mixture-based Bayes factors, for this p = 1 case, 
is carried out as described Section [3] above, for a satisfying log2(a) G — 6 : 4. For the 
P olya tree-based test, w e impl ement the fixed-partition (Type 2) version as advocated 



m 



Berger and Gug helmil fl200ll . Equation 2) with the function d{em) = h ^4"^ for scale 
parameter h satisfying log2(/i) G —6 : 4. The Bayes factors reported below are the minima 
over the range of a and h, respectively. 

Table [T] shows the Polya tree and Dirichlet process mixture Bayes factors for ten 
independent, randomly chosen data sets of size n G {50, 100, 200} from each of the 
t^u = 3), SN(k = 10), and Unif(— 0.5, 0.5) models. Here we observe that the Bayes factors 
for the Dirichlet process mixture model are generally smaller than those for the Polya tree 
model — sometimes orders of magnitude smaller — which illustrates the former's ability to 
better discriminate a non-normal true distribution from normal. For the Student-t and 
skew-normal, the Dirichlet process mixture test tends to produ ce smaller Bay es factors, so 
the conclusions one reaches, based on Jeffreys' interpretation fjjeffreyslll96l[ ). are mostly 
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n 


= 50 


n = 


100 






n = 


200 




Model 


PT 


DPM 


PT 


DPM 




PT 




DPM 






0.94 


0.91 


1.01 X 10" 


9.20 X 


10- 


-1 


6.25 X 10- 


-4 


4.72 X 10- 


-3 




0.60 


0.25 


6.91 X 10-" 


4.31 X 


10- 


-7 


5.38 X 10- 


-2 


5.13 X 10- 


-3 




0.27 


0.89 


1.53 X 10-1 


3.96 X 


10- 


-4 


1.24 X 10- 


-3 


4.28 X 10- 


-4 




0.002 


0.0003 


3.30 X 10-6 


4.30 X 


10- 


-9 


3.22 X 10 


-3 


7.43 X 10- 


-4 


4- 


0.98 


0.59 


3.30 X 10-3 


9.44 X 


10- 


-4 


1.44 X 10 


-8 


3.76 X 10- 


-13 


t 


0.13 


0.03 


1.90 X 10^ 


1.44 X 


10 


-4 


8.53 X 10- 


-5 


1.27 X 10- 


-6 




1.01 


0.96 


1.03 X lO'' 


6.40 X 


10- 


-1 


2.96 X 10 


-2 


5.07 X 10- 


-6 




0.54 


0.05 


1.02 X 10-6 


1.93 X 


10- 


-9 


2.56 X 10- 


-13 


9.32 X 10- 


-18 




1.02 


0.27 


4.44 X 10-1 


4.77 X 


10- 


-2 


5.93 X 10- 


-5 


8.29 X 10- 


-8 




0.004 


0.001 


3.86 X 10-1 


6.88 X 


10- 


-7 


1.63 X 10- 


-5 


8.00 X 10- 


-8 




0.94 


0.83 


8.25 X 10-1 


1.32 X 


10- 


-4 


3.79 X 10- 


-5 


3.60 X 10- 


-a 




1.01 


0.06 


6.38 X 10-1 


8.96 X 


10- 


-3 


3.84 X 10- 


-7 


8.62 X 10- 


-11 




0.46 


0.11 


7.68 X 10-3 


7.04 X 


10- 


-4 


6.49 X 10 


-3 


1.02 X 10- 


-5 




0.11 


0.08 


4.40 X 10-2 


5.68 X 


10- 


-3 


3.25 X 10" 


-2 


2.10 X 10- 


-5 


CM 


1.00 
1.01 


0.46 
0.34 


1.27 X 10-2 
2.38 X 10-1 


3.77 X 
3.16 X 


10- 

10- 


-4 
3 


5.20 X 10" 
3.50 X 10 


-3 

-3 


7.59 X 10- 
6.86 X 10- 


-5 
-7 




1.02 


0.45 


6.24 X 10^ 


3.48 X 


10 


3 


4.23 X 10- 


-10 


1.44 X 10- 


-13 




0.38 


0.11 


7.65 X lO-'^ 


7.78 X 


10- 


-9 


3.14 X 10- 


-4 


1.22 X 10" 


-6 




0.67 


0.29 


2.53 X 10-2 


1.07 X 


10- 


-4 


2.30 X 10- 


-2 


2.17 X 10- 


-5 




0.20 


0.25 


4.91 X 10-1 


3.71 X 


10- 


-6 


8.83 X 10- 


-7 


3.40 X 10- 


-10 




0.29 


0.01 


0.04 


1.40 X 


10- 


-7 


5.23 X 10 


-1 


4.53 X 10- 


-6 




0.20 


0.04 


0.51 


3.88 X 


10- 


-3 


1.17 X 10 


-3 


4.63 X 10- 


-9 




1.02 


0.14 


0.25 


1.22 X 


10- 


-2 


1.23 X 10- 


-8 


1.68 X 10- 


-13 




1.01 


0.01 


1.03 


1.17 X 


10- 


-5 


4.40 X 10- 


-2 


5.03 X 10- 


-7 


Unif 


0.62 


0.01 


1.02 


1.02 X 


10- 


-2 


3.17 X 10 


-2 


2.25 X 10- 


-6 


0.83 


0.04 


0.04 


1.28 X 


10- 


-3 


2.58 X 10- 


-2 


3.56 X 10- 


-8 




0.93 


0.18 


0.17 


7.93 X 


10- 


-6 


1.35 X 10- 


-5 


2.89 X 10- 


-8 




0.96 


0.38 


0.76 


1.82 X 


10- 


-2 


2.46 X 10- 


-1 


5.36 X 10- 


-5 




1.01 


0.04 


0.09 


6.92 X 


10- 


-3 


5.61 X 10 


-2 


3.94 X 10- 


-5 




1.01 


0.09 


1.04 


1.02 X 


10- 


-1 


3.62 X 10 


-2 


1.37 X 10- 


-5 



Table 1: Minimum Bayes factors for testing normality against Polya tree and Dirichlet 
process mixture alternatives, respectively. For each model and sample size configuration, 
as described in Example |2l Bayes factors are shown for ten random data sets. 

the same for the two methods for n = 50 and n = 200. However, there are some striking 
differences in the n = 100 case, in particular, those two highlighted in bold. Histograms 
of the two data sets in question are shown in Figure O along with the best-fit normal 
density function overlaid. In both cases, normality is doubtful: the Student-t data in 
panel (a) has some extreme outliers relative to the corresponding normal, and the skew- 
normal data in panel (b) is highly asymmetric. Therefore, the strong evidence against 
the null hypothesis, as indicated by our proposed test, seems more reasonable. 

The results in Table [1] for the uniform distribution are also quite surprising. Seem- 
ingly, the Polya-tree test should be better at detecting the discontinuities of the uniform 
distribution, but our results that show the Dirichlet process mixture test to be much 
more powerful are contrary to this intuition. One possible explanation is that a relative 
small sample from a uniform distribution can be better modeled by a mixture of narrow, 
densely packed smooth Gaussian densities than by some other nowhere smooth density. 
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(a) Data set #6 from t(3) 



(b) Data set #7 from SN(IO) 



Figure 2: Histograms for the data sets corresponding to the two highhghted sets of Bayes 
factors in Table [H along with the best-fit normal density function overlaid. 

In Examples [1] and [21 we see that the proposed testing procedure tends to give little 
support to the null hypothesis of normality. As a last example in this univariate case, 
we shall investigate whether the proposed testing procedure tends to over-fit the data, 
favoring the alternative even when the null is true. 

Example 3. Like in Example El we sample ten data sets of size n G {50, 100, 200}. This 
time, though, the data sets are sampled from N(0, 1), i.e., the null hypothesis is true. 
The minimum Bayes factors for the Polya tree and Dirichlet process mixture alternatives 
are shown in Table |2] over the same range of h and a as in Example IH Here we find 
that the Polya tree test generally has a slightly larger Bayes factor than the Dirichlet 
process mixture test. This is not surprising, because a normal density is much closer to 
looking like a realization from a Dirichlet process mixture model than from a Polya tree 
model. But the fact that the two methods produce Bayes factors near unity indicates 
that neither is drastically over-fitting the data. 

4.2 Multivariate case, p > 1 

Example 4. Here we consider three different simulated data sets of size n = 100 in the 
p = 2 case. The first two models in question are a bivariate normal and a bivariate 
Student-t with three degrees of freedom. The third model, defined through a copula, 
has normal marginals but a non-elliptical joint distribution. Figure El shows scatterplots 
of the three data sets, all on the same scale, along with plots of the proposed Bayes 
factor versus the Dirichlet precision parameter a. In the normal case (top row) we see 
that the Bayes factor never drops below 1, so there would be no reason to doubt the 
normality assumption. The second row shows a Student-t data set where we see a high 
concentration of points around (0, 0) along with a number of potential outliers, suggesting 
a heavier-than-normal tail. Since the Bayes factor for this case bottoms out just below 
10~^, our proposed testing procedure is apparently able to pick up the effect of the heavier 
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n 


= 50 


n = 


100 


n = 


200 


Model 


PT 


DPM 


PT 


DPM 


PT 


DPM 




0.42 


0.97 


1.03 


0.99 


1.06 


0.91 




1.00 


0.68 


1.02 


1.02 


1.07 


0.90 




0.86 


0.75 


1.03 


1.00 


1.05 


0.95 




1.02 


0.92 


0.93 


0.89 


0.97 


0.97 


N(0,1) 


1.01 


0.97 


1.02 


0.91 


1.07 


0.90 


1.02 


0.98 


0.75 


0.84 


1.09 


0.94 




1.02 


0.95 


1.05 


0.94 


1.09 


0.95 




1.02 


0.97 


1.02 


0.91 


1.06 


0.99 




1.03 


0.99 


0.65 


0.91 


0.92 


0.61 




1.02 


0.96 


1.02 


0.93 


1.11 


0.99 



Table 2: Minimum Bayes factors for testing normality against Polya tree and Diriclilet 
process mixture alternatives, respectively. These are shown for ten random data sets 
from a standard normal distribution. 

than expected tail. Finally, based on the sharp decrease in the Bayes factor in the last 
row, it is apparent that the proposed test can easily pick up the non-elliptic shape of the 
underlying distribution. 

Example 5. As a last illustration, we investigate the development of ancient Egyp- 
tian skulls. Our data consists of p = 4 measurements — maximal breadth, basibreg- 
matic height, basialveolar length, and nasal height — taken on n = 150 ancient skulls 
at one of five time periods ran g ing fr om 4000 B.C. to 200 A.D. These data appear in 



Thomson and Randall-Maciverl (119051 ): see also http: //lib . stat . emu . edu/DASL , To 

test for differences in mean skull shape across the five time periods, it is standard to 
perform a multivariate analysis of variance, or MANOVA. The MANOVA results (not 
shown) indicate that there is a significant difference between the mean skull measure- 
ments across time; however, the validity of this procedure assumes the error terms to 
have a multivariate normal distribution. Here we apply the proposed Bayesian method- 
ology to the residuals to assess the assumption of normality. Figure IH^a) shows a quantile 
plot of the observed Mahalanobis distance scores versus the theoretical chi-square null 
distribution. A visual inspection of this plot suggests that there may be some doubt 
about the postulated chi-square model, though apparently not substantial. But this plot 
is a one-dimensional summary of a four- dimensional distribution, so some information 
may be lost. For a more complete summary. Figure 111(b) shows the estimated Bayes 
factor against the Dirichlet precision parameter a. The log^o'-^^y^s factor bottoms out 
around —0.83 which suggests that there is only a small amount of evidence against the 
null hypothesis. However, due the magnitude of the Bayes factor for moderate to large a 
values, any reasonable weighted average over a would again produce a large Bayes factor. 
Therefore, we conclude that there is no significant evidence against the four-dimensional 
normality of the MANOVA residuals. 
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Figure 3: Data scatterplots and plots of the Bayes factor versus a as in Example Hi The 
first row is normal; the second row is Student-t; the third row is non-elliptical. 
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I 



Chi-Square quantiles 
(a) Quantilc plot of Mahalanobis scores 



-2 c 
log2(«) 



(b) log;^Q-Bayes factor vs. logj a 



Figure 4: Two summaries of the evidence for/against the normahty null hypothesis in 
the ancient Egyptian skull problem described in Example O 

5 Concluding remarks 

This paper presents a novel approach to testing p-variate normality, for general p > 1, 
from a Bayesian perspective. In particular, we propose a broad nonparametric alterna- 
tive based on a Dirichlet process mixture prior, and show that the null normal model 
is suitably embedded in this class. A predictive matching property of this new alterna- 
tive class is established, which justifies the use of a common, noninformative prior for 
the nuisance location and scale parameters in both the null and alternative hypothe- 
ses. Consequently, the proposed test requires the user to specify only a single scalar 
tuning parameter. An efficient sequential importance sampler is provided which allows 
for fast evaluation of Bayes factors over a range of tuning parameter values. Simulation 
results demonstrate that the proposed method has higher discriminatory power when the 
true, data-generating distribution is a smooth departure from normality, and also avoids 
over-fitting when the true distribution is normal. 

An important and challenging open question is if large-sample consistency of the re- 
sulting Bayes factor obtains. For this, one first needs a posterior consistency result for the 
proposed nonpara metric alternative mod el. The authors believe that an argument along 



the fines given in IWu and Ghosall ( l2010l ) can be used to establish posterio r consistency, 



but th is is only half the battle for Bayes factor consistency. As mentioned in lTokdar et al. 



(120 lOl ). because the nonparametric model can recover the a true n ormal model nearly as 



efficiently as a parametric model, special conditions are needed (see lMcVinish et al.ll2009l ) 
to prove Bayes factor consistency under i^o? and it is not yet clear if even standard priors 
satisfy these conditions. 
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